\section{Performance of our technique}

Unfortunately, our system (called SICL) is not yet sufficiently
finalized to allow us to make any tests of performance.  However, we
constructed a few simulations that give us some indications of the
performance of our technique compared to the technique used by PCL.

In our first test, we decided to measure the time it takes for the
generic dispatch of a simple slot reader.  This simple slot reader is
characterized by the fact that it has a single method specialized
to a single class, and that class has no subclasses.  This situation
is valid for a large number of cases, and it is therefore important to
test for.

First, we created a class with a single slot and a reader for that
slot like this:

{\small\begin{verbatim}
(defclass c () ((%x :initarg :x :reader x)))  
\end{verbatim}}

Next, we defined an instance of this class:

{\small\begin{verbatim}
(defparameter *i* (make-instance 'c :x 1))
\end{verbatim}}

Then, we created a function containing a loop where the slot reader is
called in each iteration: 

{\small\begin{verbatim}
(defun f ()
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (loop with i = *i*
        repeat 10000
        do (loop repeat 100000
                 do (x i))))}
\end{verbatim}}

In order to minimize overhead due to looping, we set the
\texttt{optimize} flags as shown.

The loop calls the reader $10^9$ times.  Since $10^9$ may not be a
fixnum on some $32$-bit platforms, we use a nested loop.  The table
below shows the result of this test on a small selection of platforms.

{\small\begin{tabular}{|c|c|c|c|c|c|}
\hline
Impl & OS & Proc & Clock & Time & Cyc\\
\hline\hline
SBCL 1.1.13 & MacOS & x86-64 & 1.8GHz & 11 & 20 \\
SBCL 1.1.16 & MacOS & x86-64 & 3.3GHz & 4.7 & 16 \\
SBCL 1.1.18 & Linux & x86-64 & 1.6GHz & 12 & 19\\
CMU 20d & Linux & x86-64 & 3GHz & 158 & 474\\
Allegro 9.0 & MacOS & x86-64 & 2.2GHz & 10 & 22\\
LispWorks 6.1.1 & Windows & x86-64 & 3GHz & 11 & 33\\
Clozure 1.9 & MacOS & x86-64 & 3.3GHz & 21 & 69\\
ABCL 1.2.1 & MacOS & x86-64 & 1.8GHz & 183 & 329\\
ABCL 1.0.1 & Linux & x86-64 & 3GHz & 152 & 456\\
\hline
\end{tabular}}

The time includes not only calling the slot reader, but also the loop
iteration overhead.  Furthermore, calling the slot reader involves not
only the generic dispatch, but also checking the argument count and
some other function-call overhead.  For now, we ignore all this
overhead.

In the table, we have included only implementations reputed to be
``high-performance'', though some interpreted implementations such as
CLISP and ECL compare favorably to the slower implementations that we
tested.  An important aspect that is not included in the table is
whether the implementation is thread safe or not.  Thread safety may
have a negative impact on performance, so implementations that are not
thread safe may look better in comparison.  Unfortunately, we do not
know which implementations are thread safe among the ones in the
table, except that we know that LispWorks 6.1.1 \emph{is} thread
safe. 

Note that the test above was designed to be as advantageous as
possible to table-based techniques in the following ways:

\begin{itemize}
\item The only instance involved will rapidly reside in the cache, so
  the additional cost of memory references is not measured.
\item There is a single class involved, allowing the implementation to
  select a better strategy for the discriminating function, as the
  paper on PCL suggests. 
\end{itemize}

To get some indication of the performance of our technique, we need to
simulate the layout of a SICL general instance.  We do that by
defining the header as a \cl{} \texttt{struct} and by using a
\emph{simple vector} for the rack.  The definition of the header looks
like this:

{\small\begin{verbatim}
(defstruct s class rack)
\end{verbatim}}

We create a simulated general instance as follows:

{\small\begin{verbatim}
(defparameter *j* 
  (let ((rack (make-array 2 :initial-contents '(10 1))))
    (make-s :class nil :rack rack)))
\end{verbatim}}

Our simulated slot reader is defined like this:

{\small\begin{verbatim}
(defun y1 (instance)
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (let* ((rack (s-rack instance))
         (stamp (svref rack 0)))
    (declare (type fixnum stamp))
    (if (= stamp 10)
        (svref rack 1)
        (error "1"))))

(proclaim '(notinline y1))
\end{verbatim}}

Finally, we define a function containing a loop that calls our
simulated slot reader:

{\small\begin{verbatim}
(defun g1 ()
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (loop with j = *j*
        repeat 1000000000
        do (y1 j)))
\end{verbatim}}

On our computer (x86-64 running at 1.6GHz), executing this function
takes less than $3.2$ seconds, which represents a significant
improvement.  Comparing to our performance table, $3.2$ seconds represent $5$
clock cycles.  Even though the test for the correct stamp in the
simulated reader always succeeds, we claim that this situation is the
usual one for this type of slot reader, since the test will fail only
in an error situation. 

The comparison above is somewhat biased in our favor because we can
not be sure that a single test will suffice in order to determine the
correct method to invoke.  For that reason, we devised the following
test: 

{\small\begin{verbatim}
(defun y2 (instance)
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (let* ((rack (s-rack instance))
         (stamp (svref rack 0)))
    (declare (type fixnum stamp))
    (cond ((> stamp 1280) (error "1"))
          ((> stamp 640)  (error "2"))
          ((> stamp 320)  (error "3"))
          ((> stamp 160)  (error "4"))
          ((> stamp 80)   (error "5"))
          ((> stamp 40)   (error "6"))
          ((> stamp 20)   (error "7"))
          ((> stamp 10)   (error "8"))
          (t (svref rack 1)))))

(proclaim '(notinline y2))

(defun g2 ()
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (loop with j = *j*
        repeat 1000000000
        do (y2 j)))
\end{verbatim}}

This test simulates a situation where the generic dispatch needs $8$
tests to determine what method to call, a situation which can be
thought of as a generic function with $256$ methods, or alternatively
a single method but where the call history contains $256$ classes with
sparse unique numbers.

On our computer, executing this function takes less than $4.7$ seconds,
which seems to suggest that the number of comparisons has only a
modest impact on performance.  In fact, the numbers suggest that
less that one additional clock cycle per additional comparison is
required.%
\footnote{In this case, however, the \emph{branch prediction} circuits
  of the processor always guess the right thing}

The following test is meant to show the impact of a very large number
of classes, so that constants are no longer very small:

{\small\begin{verbatim}
(defun y3 (instance)
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (let* ((rack (s-rack instance))
         (stamp (svref rack 0)))
    (declare (type fixnum stamp))
    (cond ((> stamp 12800000) (error "1"))
          ((> stamp 6400000)  (error "2"))
          ((> stamp 3200000)  (error "3"))
          ((> stamp 1600000)  (error "4"))
          ((> stamp 800000)   (error "5"))
          ((> stamp 400000)   (error "6"))
          ((> stamp 200000)   (error "7"))
          ((> stamp 100000)   (error "8"))
          (t (svref rack 1)))))

(proclaim '(notinline y3))

(defun g3 ()
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (loop with j = *j*
        repeat 1000000000
        do (y3 j)))
\end{verbatim}}

On our computer, executing this function takes exactly the same time to
execute as the previous one, which seems to suggest that the size of
the constants has very little impact on performance, at least on our
platform.

The final test we conducted is meant to exercise the branch-prediction
logic of the processor.  It simulates a (highly unusual) situation
where a slot reader is specialized to a class, but that class has
several subclasses, and the slot is in a different location in each
subclass.  Furthermore, the test is designed so that the branch taken
alternates for each test:

{\small\begin{verbatim}
(defun y4 (instance)
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (let* ((rack (s-rack instance))
         (stamp (svref rack 0)))
    (declare (type fixnum stamp))
    (if (< stamp 2)
        (if (< stamp 1)
            (svref rack 1)
            (svref rack 2))
        (if (< stamp 3)
            (svref rack 3)
            (svref rack 4)))))

(proclaim '(notinline y4))

(defun make (stamp)
  (let ((rack (make-array 5
                          :initial-contents
                          (cons stamp '(1 1 1 1)))))
    (make-s :class nil :rack rack)))

(defun g4 ()
  (declare (optimize (safety 0) (speed 3) (debug 0)))
  (loop with j0 = (make 0)
        with j1 = (make 1)
        with j2 = (make 2)
        with j3 = (make 3)
        repeat 250000000
        do (y4 j0) (y4 j2) (y4 j1) (y4 j3)))
\end{verbatim}}

On our computer, executing this function takes less than $3.4$ seconds
which represents a performance penalty of less than 10\% compared to
the best case where a single test with a single outcome is performed.
It should be noted, however, that this test represents an unusually
great disadvantage for our technique, and that it is not clear that
the times indicated for existing techniques in our performance table would
be as low for this situation.

The tests include loop overhead that should be subtracted from the
timing results, but this overhead is probably very small, so we can
ignore it.  Furthermore, disassembling the simulated slot readers show
that the code is very close to what we expect the SICL compiler to
emit, so there is very little overhead there as well.  

Notice, however, that the results all include the overhead of a
function call.  This function call can not be avoided, suggesting
that our results reflect the real observable improvement.  However,
in order to appreciate the performance improvement of dispatch
mechanism itself, the overhead of the function call should not be
included, which suggests that the net improvement is significantly
higher than a factor $3$. 



